31.5 Text Mining

389

31.5

Text Mining

One consequence of the apparent reluctance of experimenters in the biological sci-

ences to assign numbers to the phenomena they investigate is that the experimental

literature is very wordy and hence voluminous. Indeed, the literature of biology

(the “bibliome”)—especially research papers published in journals—has become so

vast that even with the aid of review articles that summarize many results within

a few pages it is impossible for an individual to keep abreast of it, other than in

some very specialized part. Text mining in the first instance merely seeks to auto-

mate the search process, by considering, above all, facts uncovered by researchers.

Keyword searches, which nowadays can be extended to cover the entire text of a

research paper or a book, are straightforward—an instance of string matching (pat-

tern recognition)—but typically the results of such searches are nowadays themselves

too vast to be humanly processed, and more sophisticated algorithms are required.

Automated summarizing is available, based on selecting those sentences in which

the most frequent information-containing words occur, but this is generally success-

ful only where the original text is rather simply constructed. The Holy Graal in the

field is the automated inference of semantic information; hence, progress depends

on progress in automated natural language processing. Equations, drawings, and

photographs pose immense problems at present. Some protagonists even have the

ambition to automatically reveal new knowledge in a text, in the sense of ideas not

held by the original writer (e.g., hitherto unperceived disease–gene associations).

It would certainly be of tremendous value if automatic text processing could

achieve something like this level. 12 Research papers could be automatically com-

pared with one another, and contradictions highlighted. This would include not only

contradictory facts but also facts contradicting the predictions of hypotheses. High-

lighting the absence of appropriate controls, or inadequate evidence from a statistical

viewpoint, would also be of great value. In principle, all of this is presently done

by individual scientists reading and appraising research papers, even before they are

published, through the peer-review process, which ensures, in principle at least, that

a paper is read carefully by someone other than the author(s) at least once; papers

not meeting acceptable standards should not—again, in principle—be accepted for

publication, but the volume of papers being submitted for publication is now too

large to make this method rigorously workable. Another difficulty is the already

immense and still growing breadth of knowledge required to properly review many

papers. One attempt to get over that problem was to start new journals dealing with

small subsets of fields, in the hope that if the boundaries are sufficiently narrowly

delimited, all relevant information can be taken into account. However, this is a hope-

less endeavour: Knowledge is expanding too rapidly and unpredictably for it to be

possible to regulate its dissemination in that way. Hence, it is increasingly likely that

relevant facts are overlooked (and sometimes useful hypotheses too). Furthermore,

the reviewing process is highly fragmented: it is a kind of work that is difficult to

divide among different individuals, and the general trend for the number of scientists

12 Cf. the end of the introductory section in Chap. 27.